Moving IPNI Queries to Spark Checkers

GitHub issue: https://github.com/filecoin-station/spark/issues/40

Introduction

In the long term, we want Spark checkers running in Filecoin Station to sample all active Filecoin deals to pick a retrieval check to perform.

Currently, we have a manual process to scan the active deals and resolve them into task templates defined as (cid, address, protocol) and stored in our Postgres DB. This has several downsides:

The process is rather inefficient and consumes many resources, as we have to make one IPNI query for every FIL+ LDN deal. Since there are ~8.7m deals and our network checks ~4k CIDs per hour, it takes ~90 days for our network to test all task templates. We run the manual process every month or two, and so many IPNI queries are wasted because we never use their responses.

We cannot measure how the rate of deals advertised to IPNI evolves over time. Our “retrieval success rate” is calculated only for CIDs advertised to IPNI.

Because there are very few deals that advertise HTTP transport to IPNI, and because Spark checkers currently cannot check deals that are not advertised to IPNI, we have to include Graphsync protocol in the retrievals we test. Otherwise, we would have an extremely small sample set and put too much load on the few SPs with HTTP retrievals correctly configured. (We already experienced this in December '23.) Graphsync is not well supported; we shouldn’t be testing it.

Proposal

Let Spark checkers query IPNI to resolve a cid from a FIL+ LDN deal into the (cid, provider_address, protocol) task.

Test retrievals using the HTTP protocol only; drop Graphsync.

Introduce a new Measurement field called indexerResult. This field will store the result of the IPNI query and allow us to introduce two new retrieval check results - “not advertised at all” and “http not advertised”.

Step 1: spark checkers

Rework the task-picking code to take only the cid field from the task definition picked from retrievalTasks and ignore providerAddress and protocol fields.

Rework the code executing a task to start with an IPNI query to resolve the CID into the provider address. IPNI query is a simple HTTP GET request with the CID provided in the request path. IPNI returns 404 when no advertisements are found. Example URL: https://cid.contact/cid/bafybeibhsqlh4phj3r3seetqvbq4xebwzz4tvkpckrc4icoeztdxnipbam
1. No advertisement found → report result “not advertised at all”
1. No advertisement offering HTTP protocol → report result “http not advertised”
1. Pick the first advertisement for the HTTP protocol
  In the future, we will look for an advertisement matching the FIL deal we are testing.
1. Pick the first address provided in this advertisement
1. Hardcode the protocol to http

When submitting the result of the retrieval check (a new measurement) to SPARK API, include a new field in the request body: indexerResult.

In the future, we will rework the check to drop Lassie and fetch directly over HTTP(s), but let’s do this one step at a time.

Step 2: spark-evaluate (fraud detection)

When checking whether a measurement is for a valid task in the evaluated round, compare only the field cid. Ignore provider_address and protocol.

After we implement honest-majority committees, we should add cross-checks for IPNI resolution - add the following tasks to the relevant GH issue (https://github.com/filecoin-station/roadmap/issues/59):

All measurements in the majority should report the same indexerResult

All measurements in the majority should report the same provider_address and protocol

Step 3: indexer result

In spark-api, modify the data model for measurements to include a new optional field - indexerResult. Modify the endpoint recording measurements to store this new field in the DB.

In spark-publish, include indexerResult in data committed to IPFS.

Step 4: spark-api & retrieval tasks

We will take the easy path and make a breaking change to get this done quickly. I have verified that existing Spark clients will handle the situation gracefully.

Introduce a new SQL table with a list of FIL+ LDN deals we want to test for retrievability. This table will be similar to retrieval_templates and have the following columns:
- cid (primary key; extracted from deal Label)
- expires_at (timestamp with tz, calculated from deal proposal’s EndEpoch field)
In the future, we will need metadata to match IPNI responses to this particular deal. We don’t have that implemented yet, so let's keep the db schema simple and the table size small for now.
I considered adding id as a primary key, but we don’t use it anywhere. We can rework the primary key to include multiple columns after we add more columns. The benefit of the proposed scheme is that we will naturally enforce the uniqueness of templates. (The rows in retrieval_templates are not unique.)

Modify the scripts in fil-deal-ingester to generate SQL statements for populating this table using deals extracted from the on-chain actor state.
We must ingest FIL+ LDN deals in two steps:
- First, as part of DB schema migration described above, we need to seed the new table with ~1000 deals to pick from. This will give us working data in the local DB used, e.g. for running the tests.
- In the next step, we will ingest all ~9m FIL+ LDN deals by making SQL queries directly against the production DB.

Deploy the changes above before proceeding with the next steps.

Rework the code defining retrieval tasks for a new Meridian round to pick CIDs from the list of FIL+ LDN deals. Set columns provider_address and protocol to NULL.
In the future, we will need metadata to match IPNI responses to this particular deal. We don’t have that yet, so let's keep the schema simple and the table small in size for now.

Verify that the response of GET /rounds/meridian/{contract}/{round-index} converts NULL values in provider_address and protocol to undefined values in JavaScript, therefore excluding these fields from the JSON response.
The goal here is to preserve the provider address & protocol in the responses for older rounds where the retrieval tasks were defined with the provider address and protocol to use.

Drop the table retrieval_templates

When recording a new measurement, check the spark version before validating the measurement to trigger the “outdated client” error.

Migration path

It takes a while to upgrade all running Spark checkers to a new version. In the meantime, we need old nodes to behave gracefully. Ideally, they should tell the user to upgrade.

My proposal above does exactly that, as I verified by running a tweaked checker node that deletes providerAddress and protocol fields from task details.

The checker node picks a random task

It asks Lassie to retrieve the given CID using protocols=undefined&providers=undefined

Lassie returns a 400 error

Spark checker submits a measurement with missing providerAddress & protocol

Spark API rejects the request. We will modify SPARK API to return “outdated client” error.

Full checker log: https://gist.github.com/bajtos/83fbee4c2e3af16bfcaf9dd38d5cd4b4